Executive Summary

The final inquiry question our team landed on is to investigate which states are purchasing the most research-related drugs, and whether there is a correlation between this and the socio-economic status of that state.

Key Findings¶

Finding 1¶

As the unemployment rate of a state increases, the total amount spent on research-related prescriptions increases.

In [50]:
finding1.show()

This finding may seem to come with plenty of outliers. Indeed, if it were not for these few outlying states — California, Texas, Florida, New York, Massachusetts and North Carolina — the data would seem to follow a strong positive trend.

These 6 states remain troubling for the entirety of our analysis. They are outliers in every sense of the inquiry question: these 6 states spend a fortune more than other on research-related prescriptions, as is made clear in the choropleth map and boxplot below.

In [51]:
choro_finding1.show()
box_finding1.show()

Yes, those outliers on the boxplot are the same 6 troublesome states. Hence, our inquiry question shifted slightly; we already understand that unemployment rate can describe the general overall trend between states and total money spent on research-related drugs, but it was now that we started to investigate what makes states such as California and Texas different. Is there a correlation that we are missing?

Finding 2¶

As the number of general-care hospitals and military hospitals in a state increases, the amount spent on research-related presciptions also increases.

In [52]:
general_fig.show()
military_fig.show()

When you consider that from the original dataset, less than 1 in 5 entries were related to a teaching hospital (see pie chart below), the first correlation is not as obvious as it may first appear. The increase in research-related payments upon an increase in general care hospitals may be attributed to the fact that hospitals tend to be the first to trial new or experimental drugs.

In [53]:
pie_finding2.show()

The second correlation is completely unobvious, and quite surprising to be honest. There are no clear answers our team could come up with as to why this correlation exists.


Complete Technical Exposition

In the following section you will see our team's entire working out start to finish. This includes inquiry questions which did not come to fruition, data cleaning and wrangling, EDA, unsuccessful analysis pathways and finally, all code to produce our findings.

Investigation 1¶

Investigate which types of medicines/diseases are getting the most funding in what areas, and compare this to the true prevalance of the disease around the US.

In [1]:
import pandas as pd
import plotly
import plotly.express as px
pd.options.plotting.backend = "plotly"
plotly.offline.init_notebook_mode()
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt

import os
if not os.path.exists("images"):
    os.mkdir("images")
In [2]:
research = pd.read_csv("data/citadel/Final_Datasets/Research_Payments_2020.csv", low_memory = False)

Wrangling and cleaning¶

Creating a smaller dataset¶

From this table, we're interested in finding out:

  • the prevalence of these research-related drugs in different parts of the country
  • what field of medicine the drugs are related to, and
  • how much money is involved in the transaction of these drugs

Here we create a new, smaller dataset by removing unnecessary columns.

In [3]:
columns_of_interest = ["Recipient_City", "Recipient_State", "Recipient_Zip_Code", "Recipient_Country",
                      "Recipient_Province", "Recipient_Postal_Code", 
                      "Product_Category_or_Therapeutic_Area_1",
                      "Product_Category_or_Therapeutic_Area_2", 
                      "Product_Category_or_Therapeutic_Area_3",
                      "Product_Category_or_Therapeutic_Area_4",
                      "Product_Category_or_Therapeutic_Area_5",
                      "Total_Amount_of_Payment_USDollars", "Name_of_Study", "Manufacturer_or_GPO_State"]

research_iq = research.copy()[columns_of_interest]

Note that the drugs have been purchased across multiple countries.

In [4]:
print("Countries in the research dataset:")
for i in research_iq["Recipient_Country"].unique():
    print(f"\t - {i}")
Countries in the research dataset:
	 - United States
	 - Great Britain (Uk)
	 - nan
	 - Canada
	 - South Africa
	 - Zambia

We are only interested in data regarding the United States, so we conduct this filtering now.

In [5]:
research_us = research_iq[research_iq["Recipient_Country"] == "United States"]
research_us = research_us.drop(columns = ["Recipient_Country"])
In [6]:
research_us.head()
Out[6]:
Recipient_City Recipient_State Recipient_Zip_Code Recipient_Province Recipient_Postal_Code Product_Category_or_Therapeutic_Area_1 Product_Category_or_Therapeutic_Area_2 Product_Category_or_Therapeutic_Area_3 Product_Category_or_Therapeutic_Area_4 Product_Category_or_Therapeutic_Area_5 Total_Amount_of_Payment_USDollars Name_of_Study Manufacturer_or_GPO_State
0 Boston MA 02241 NaN NaN PERIPHERAL VASCULAR NaN NaN NaN NaN 560.00 PE Study (PE) / Optalyse MA
1 Hackensack NJ 07601 NaN NaN ONCOLOGY NaN NaN NaN NaN 1.01 A Phase III, Randomized, Double-blind, Matchin... NJ
2 Doral FL 33166 NaN NaN Potassium Binder NaN NaN NaN NaN 884.26 DIAMOND PAT-CR-302 NaN
3 San Francisco CA 94115 NaN NaN NaN NaN NaN NaN NaN 5900.00 A STUDY OF SAPANISERTIB, COMBINATION OF SAPANI... IL
4 Delray Beach FL 33484 NaN NaN PERIPHERAL VASCULAR NaN NaN NaN NaN 227.50 KNOCOUT PE MA

Data cleaning¶

Filling in alternate NaN values¶

Upon brief inspection of the dataset, it is clear that there are multiple values used in place of NaN such as:

  • 'nan'
  • 'NONE'
  • 'NOT APPLICABLE'
  • 'None'
  • 'Not Applicable'

Let's replace these with NaN.

In [7]:
research_us = research_us.replace(["nan", "NONE", "None", "none", "NOT APPLICABLE", "Not Applicable"], np.nan)

Checking location data exists¶

The location of where each drug has been purchased is one of the key pieces of information we need in this inquiry. Hence, it is worthwhile to ensure this data exists for each of our entries. The variables of interest here are:

  • Recipient_City
  • Recipient_State
  • Recipient_Zip_Code
  • Recipient_Province
  • Recipient_Postal_Code

Let's check how many entries do not contain this information.

In [8]:
no_city = research_us[research_us["Recipient_City"].isna()].shape[0]
no_state = research_us[research_us["Recipient_State"].isna()].shape[0]
no_zip = research_us[research_us["Recipient_Zip_Code"].isna()].shape[0]
no_province = research_us[research_us["Recipient_Province"].isna()].shape[0]
no_post_code = research_us[research_us["Recipient_Postal_Code"].isna()].shape[0]

print(f"There are {research_us.shape[0]} entries in total.")
print(f"There are {no_city} entries with no recipient city value.")
print(f"There are {no_state} entries with no recipient state value.")
print(f"There are {no_zip} entries with no zip code value.")
print(f"There are {no_province} entries with no province value.")
print(f"There are {no_post_code} entries with no post code value.")
There are 588027 entries in total.
There are 6 entries with no recipient city value.
There are 0 entries with no recipient state value.
There are 0 entries with no zip code value.
There are 588027 entries with no province value.
There are 588027 entries with no post code value.

It seems as though every single entry in our US dataset has no province or post code value. This is a sensible result — the US uses zip codes rather than post codes, and has states rather than provinces.

Also note that every entry in our dataset has both a zip code and a state associated with it. It is up to us to decide which location variable to use here, however after examining Appendix 1 it should be clear that the zip code variable is too granular to be of much use. Hence we will stick with using Recipient_State.

To make it clearer for the international community, we add a column with the state's full name as well, rather than just the abbreviation.

In [9]:
state_names = pd.read_csv("data/usStates.csv")
state_names.drop(columns = "Abbrev", inplace = True)
state_names.head()
Out[9]:
State Code
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [10]:
research_us = research_us.merge(state_names, how = "inner", left_on = "Recipient_State", right_on = "Code")
In [11]:
# delete empty and unused columns
research_us = research_us.drop(columns = ["Recipient_City", "Recipient_Province", "Recipient_Postal_Code", "Code"])

Checking product category data exists¶

We need to correlate which type of medicinal products under research are going to which zip codes in the US. For this, we need to ensure that the Product_Category_or_Therapeutic_Area variables are usable.

In [12]:
# how many entries have no data about the product category?

empty_product_categories = research_us[
    research_us["Product_Category_or_Therapeutic_Area_1"].isna() &
    research_us["Product_Category_or_Therapeutic_Area_2"].isna() &
    research_us["Product_Category_or_Therapeutic_Area_3"].isna() &
    research_us["Product_Category_or_Therapeutic_Area_4"].isna() &
    research_us["Product_Category_or_Therapeutic_Area_5"].isna()
]
print(f"There are {empty_product_categories.shape[0]} entries with no information regarding the product category.")
print(f"This is {round((empty_product_categories.shape[0]/research_us.shape[0]) * 100, 2)}% of the dataset.")
There are 229508 entries with no information regarding the product category.
This is 39.21% of the dataset.

That's not a good sign. Over one-third of the entries in our dataset have no information regarding what was actually purchased. The only options we have here to ensure data integrity are to either remove these entries completely, or try and interpolate what the product category is from another variable, such as Name_of_Study.

Let's see the format of Name_of_Study lends itself to being used in an NLP algorithm.

In [13]:
# print the first ten research paper names
for i in range(10):
    print(f'{i}: {research_us["Name_of_Study"][i]}')
    if i != 9: 
        print()
0: PE Study (PE) / Optalyse

1: Ph I Dose escalation w GDC 0077 in breast cancer

2: A RANDOMIZED, DOUBLE-BLIND, PLACEBO-CONTROLLED STUDY TO ASSESS THE EFFICACY, SAFETY, PHARMACOKINETICS AND PHARMACODYNAMICS OF AGN-242428 IN PATIENTS WITH PLAQUE PSORIASIS.

3: A Clinical Study to Demonstrate Safety and Efficacy of E7777 (Denileukin Diftitox) in Persistent or Recurrent Cutaneous T-Cell Lymphoma

4: A PHASE 3, RANDOMIZED, 3-PART STUDY TO INVESTIGATE THE EFFICACY AND SAFETY OF DUPILUMAB IN ADULT AND ADOLESCENT PATIENTS WITH EOSINOPHILIC ESOPHAGITIS (EOE)

5: A 24 Week Multicenter Randomized Double blind Parallel Group Placebocontrolled Study to Investigate the Effects of Saxagliptin and Sitagliptin in Patients With Type 2 Diabetes Mellitus and Heart Failure

6: BIOFLOW-VII

7: Real-World Persistence of Tocilizumab Compared With Other Subcutaneous Biologic Disease-Modifying Antirheumatic Drugs Among Patients With Rheumatoid Arthritis Switching From Another Biologic

8: KEYMAKER U01 Substudy 3 A Phase 2, Umbrella Study with Rolling Arms of Investigational Agents in Combination with Pembrolizumab in Patients with Advanced Non-Small Cell Lung Cancer NSCLC Previously Treated with anti-PD-L1 Therapy

9: ORAL BUDESONIDE SUSPENSION (OBS) IN ADOLESCENTS AND ADULT SUBJECTS WITH EOSINOPHILIC ESOPHAGITIS: DB, MAINTENANCE

A brief examination of the Name_of_Study variable shows us no common pattern from which we could potentially understand the broader field of medicine the entry is related to. Some names are simply the Study ID numbers (such as entries 2 and 9), some names are the name of the trial program rather than the study itself (such as entry 4) and some names will require a pair of medical eyes to decipher what the broader category is (such as entry 6).

But we also cannot simply remove 40% of our dataset. It may be worthwhile attempting an alternative research question which is not related to the product category.

Investigation 2¶

Which states are purchasing the most research-related drugs? Is there a correlation between this and the socio-economic status of that state?

Wrangling a new dataset¶

To solve this question clearly, we will need more information than what is provided in the research payments dataset. Particularly, we will require specific information regarding each state such as its population and socio-economic status.

Cleaning and aggregating research_us¶

In [14]:
# creating a new dataset without product category variables
research_states = research_us.copy()

columns_to_drop = ["Product_Category_or_Therapeutic_Area_1", "Product_Category_or_Therapeutic_Area_2",
                 "Product_Category_or_Therapeutic_Area_3", "Product_Category_or_Therapeutic_Area_4",
                 "Product_Category_or_Therapeutic_Area_5", "Name_of_Study", "Recipient_Zip_Code"]

research_states.drop(columns = columns_to_drop, inplace = True)
In [15]:
# renaming columns for convenience
research_states.rename(columns = {"Recipient_State": "state_code",
                                  "Total_Amount_of_Payment_USDollars": "payment",
                                  "State": "state"}, inplace = True)

Let's aggregate our dataset per state — but do we aggregate by the sum of all prescriptions bought or the number of prescriptions bought? We construct a basic scatter plot below to test the correlation value between the two aggregations.

In [16]:
states_agg = research_states.groupby(["state", "state_code"])["payment"].agg(sum = "sum", count = "count").reset_index()
states_agg.head()
Out[16]:
state state_code sum count
0 Alabama AL 6.337052e+07 7751
1 Alaska AK 2.513318e+06 318
2 Arizona AZ 1.010216e+08 12611
3 Arkansas AR 1.957926e+07 3830
4 California CA 6.942451e+08 60180
In [17]:
fig = states_agg.plot(kind = "scatter", 
                      x = "count",
                      y = "sum",
                      log_x = True,
                      log_y = True,
                      title = "Correlation between number of prescriptions and total cost of prescriptions per state",
                      hover_data = ["state"],
                      trendline = "ols",
                      trendline_options= dict(log_x = True, log_y = True))
fig.show()

It is evident (and also logical) that there is a very strong ($R^2 = 0.97$) correlation here, so it shouldn't really matter how we aggregate it. That being said, as we will be importing data on the average household incomes of each state, it may be more interesting to retain the total dollars spent on prescriptions.

In [18]:
# deleting count column
states_agg.drop(columns = "count", inplace = True)

Importing demographic data by state¶

The data here is taken from the DP03 table of the 2017 American Community Survey 5-year estimates. Note that this dataset is now 5 years old — the exact figures will not be completely accurate. However, they can still provide us good indicative insights.

In [19]:
us_demographics = pd.read_csv("data/usDemo.csv")
# selecting appropriate variables
us_demo = us_demographics.copy()[['State', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White', 'Black', 
                  'Native', 'Asian', 'Pacific', 'VotingAgeCitizen', 'Income',
                  'IncomePerCap', 'Poverty', 'ChildPoverty', 'Unemployment']]

Note that many of these variables are in percentages, and so will return meaningless results upon aggregation. To circumvent this, we first convert these columns back into raw numbers.

In [20]:
columns_in_percentages = ['Hispanic', 'White', 'Black', 'Native', 'Asian', 'Pacific', 'Poverty',
                          'ChildPoverty', 'Unemployment']
for column in columns_in_percentages:
    us_demo[f"{column}_Count"] = us_demo[column] * us_demo["TotalPop"] / 100
    
us_demo.drop(columns = columns_in_percentages, inplace = True)

All variables are now raw values. Upon aggregation per state, we wish to sum() all variables besides Income, and IncomePerCap which we will mean().

In [21]:
agg_dict = {'TotalPop': 'sum', 'Men': 'sum', 'Women': 'sum', 'Hispanic_Count': 'sum', 
            'White_Count': 'sum', 'Black_Count': 'sum', 'Native_Count': 'sum', 'Asian_Count': 'sum', 
            'Pacific_Count': 'sum', 'VotingAgeCitizen': 'sum', 
            'Income': 'mean', 'IncomePerCap': 'mean', 
            'Poverty_Count': 'sum', 'ChildPoverty_Count': 'sum', 'Unemployment_Count': 'sum'}

demograph_state = us_demo.groupby("State").agg(agg_dict).reset_index()
demograph_state.rename(columns = {"Income": "AvgIncome", "IncomePerCap": "AvgIncomePerCap"}, inplace = True)
In [22]:
# Now we convert these columns back into percentages.

for column in columns_in_percentages:
    demograph_state[f"{column}_%"] = (demograph_state[f"{column}_Count"] / demograph_state["TotalPop"]) * 100
    #demograph_state.drop(columns = f"{column}_Count", inplace = True)

Merging our two datasets¶

We currently have two datasets:

  • states_agg: the total amount of money spent on research-related prescriptions per state
  • demograph_state: socio-economic data per state

Merging these will provide us with a complete, unified dataset we can use to conduct analysis.

In [23]:
states = demograph_state.merge(states_agg, left_on = "State", right_on = "state")
states.drop(columns = "state", inplace = True)
states.rename(columns = {"sum": "Total_Spent", "state_code": "State_Code"}, inplace = True)
states.head(1)
Out[23]:
State TotalPop Men Women Hispanic_Count White_Count Black_Count Native_Count Asian_Count Pacific_Count ... White_% Black_% Native_% Asian_% Pacific_% Poverty_% ChildPoverty_% Unemployment_% State_Code Total_Spent
0 Alabama 4850771 2350806 2499965 198309.036 3198115.149 1281023.929 22598.225 62034.237 1453.847 ... 65.930038 26.408666 0.465869 1.278853 0.029971 18.166942 25.634724 7.750819 AL 63370519.18

1 rows × 27 columns

Contextualising the inquiry question¶

Proliferation of research-related prescriptions in the US (heatmap)¶

Note: Uslaner writes in The Pitfalls of Per Capita,

...per capita indices, unlike standardized scores are nonlinear transformations of the original data...difficulties arise when per capita measures are employed in techniques encompassed in the general linear model.

Hence we refrain from using per capita indices in the context of "amount paid towards research-related prescriptions per capita."

Let's now create a choropleth map of each state's spending on research-related prescriptions per capita.

In [24]:
choro_finding1 = px.choropleth(states, 
                               locations = "State_Code", 
                               color = "Total_Spent",
                               color_continuous_scale = "blues",
                               locationmode = "USA-states", 
                               scope = "usa", 
                               hover_name = "State",
                               hover_data = ["TotalPop", "Total_Spent"],
                               labels = {"State_Code": "State Code",
                                         "Total_Spent": "Total spending ($)",
                                         "TotalPop": "Total population"},
                               title = "Amount of money spent on research-related prescriptions")

box_finding1 = px.box(states, 
             x = "Total_Spent", 
             points = "all",
             hover_data = ["State"],
             title = "Distribution of research-related prescription purchases per state",
             color_discrete_sequence = ["#2A6AAF"]
            )

choro_finding1.show()
choro_finding1.write_image("images/finding1.1.svg")
box_finding1.show()
box_finding1.write_image("images/finding1.2.svg")

It's clear from both the choropleth map and the boxplot that California, Texas, Florida, New York, Massachusetts and North Carolina are all states of interest. Why, out of all American states, do these six have the most research-related prescription purchases? What makes them different from the other states? We also need to keep the other end of the spectrum in mind — why does a state like Wyoming not even spend $1M on research-related products?

Analysing our dataset¶

Finding simple correlations between variables¶

Here we try to find correlations between any two variables. Refer to Appendix 2 for our reasoning as to why we use _% variables here.

In [25]:
states_pc_only = states[['State', 'TotalPop', 'Men', 'Women', 'Hispanic_%', 'White_%',
                         'Black_%', 'Native_%', 'Asian_%', 'Pacific_%', 'Poverty_%',
                         'ChildPoverty_%', 'Unemployment_%', 'State_Code', 'Total_Spent',
                         'AvgIncome', 'AvgIncomePerCap']]

corr_mat_pc = states_pc_only.corr()
heatmap_pc = px.imshow(corr_mat_pc)
heatmap_pc.show()

Let's examine the correlation between Unemployment_% and Total_Spent.

In [26]:
finding1 = px.scatter(states, 
                 x = "Unemployment_%",
                 y = "Total_Spent", 
                 hover_data = ["State"],
                 trendline = "ols",
                 title = "Spending on research-related prescriptions vs unemployment")
finding1.show()
finding1.write_image("images/finding1.svg")

The data would seem to follow a strong positive trend if it were not for the outliers; unsurprisingly, once again we see California, Texas, Florida, New York, Massachusetts and North Carolina being the outliers in this graph.

There is something different about these 6 states.

Is there a correlation between these 6 states and the most active pharma states?¶

In [27]:
pharma_states = research_us.groupby("Manufacturer_or_GPO_State")["Name_of_Study"].count().reset_index()
pharma_states.rename(columns = {"Name_of_Study": "Number_of_prescriptions"}, inplace = True)
In [28]:
pharma_states.head()
Out[28]:
Manufacturer_or_GPO_State Number_of_prescriptions
0 AZ 692
1 CA 60436
2 CO 3046
3 CT 2146
4 DC 1127
In [29]:
choro = px.choropleth(pharma_states, 
                      locations = "Manufacturer_or_GPO_State", 
                      color = "Number_of_prescriptions",
                      color_continuous_scale = "reds",
                      locationmode = "USA-states", 
                      scope = "usa",
                      title = "States with the most active pharmaceutical manufacturers")

choro.show()

No.

Is there a correlation between these 6 states and where most investment into pharma companies comes from?¶

In [30]:
investment = pd.read_csv("data/citadel/Final_Datasets/Ownership_Investment_2020.csv")

Variables of interest:

  • Recipient_State
  • Value_of_Interest
In [31]:
pharma_investments = investment.groupby("Recipient_State")["Value_of_Interest"].sum().reset_index()
pharma_investments.rename(columns = {"Recipient_State": "State_Code"}, inplace = True)
In [32]:
choro = px.choropleth(pharma_investments, 
                      locations = "State_Code", 
                      color = "Value_of_Interest",
                      color_continuous_scale = "greens",
                      locationmode = "USA-states", 
                      scope = "usa",
                      title = "Sources of investment into the pharmaceutical industry")

choro.show()

It seems that most investment into pharmaceutical companies comes from California.

Could this just be a function of per capita income i.e. California is simply a rich state, lending to more investment into business?

In [33]:
pharma_investments = pharma_investments.merge(states)
In [34]:
fig = px.scatter(pharma_investments, 
                 x = "AvgIncomePerCap",
                 y = "Value_of_Interest", 
                 hover_data = ["State"],
                 trendline = "ols",
                 title = "Do rich people invest into pharmaceuticals more?")
fig.show()

Clearly not. It seems California has an excessive amount of investment into pharmaceutical companies compared to its average income per capita. Ohio also fits into this category.

This could very well be a future analysis path to take: why do Californians and Ohioans invest so readily into pharmaceuticals?

Is there a correlation between these 6 states and the states with the most hospitals?¶

To investigate this correlation, we will need a dataset relating to the number of hospitals in each state. The Homeland Infrastructure Foundation-Level Data (HIFLD) website has such a dataset.

In [35]:
hospitals = pd.read_csv("data/Hospitals.csv")
In [36]:
hospitals.columns
Out[36]:
Index(['X', 'Y', 'OBJECTID', 'ID', 'NAME', 'ADDRESS', 'CITY', 'STATE', 'ZIP',
       'ZIP4', 'TELEPHONE', 'TYPE', 'STATUS', 'POPULATION', 'COUNTY',
       'COUNTYFIPS', 'COUNTRY', 'LATITUDE', 'LONGITUDE', 'NAICS_CODE',
       'NAICS_DESC', 'SOURCE', 'SOURCEDATE', 'VAL_METHOD', 'VAL_DATE',
       'WEBSITE', 'STATE_ID', 'ALT_NAME', 'ST_FIPS', 'OWNER', 'TTL_STAFF',
       'BEDS', 'TRAUMA', 'HELIPAD', 'GlobalID'],
      dtype='object')

Variables of interest:

  • TYPE:
    • GENERAL ACUTE CARE
    • PSYCHIATRIC
    • CHILDREN
    • LONG TERM CARE
    • CRITICAL ACCESS
    • REHABILITATION
    • MILITARY
    • WOMEN
    • SPECIAL
    • CHRONIC DISEASE
  • STATUS:
    • CLOSED
    • OPEN
  • STATE
  • NAICS_DESC

We first filter out any closed hospitals before performing our aggregate by state.

In [37]:
open_hospitals = hospitals[hospitals["STATUS"] == "OPEN"]
In [38]:
hospitals_state = open_hospitals.groupby(["STATE", "TYPE"])["ID"].count().reset_index()
hospitals_state.rename(columns = {"STATE": "State_Code", 
                                  "TYPE": "Type", 
                                  "ID": "Hospital_Count"}, inplace = True)
hospitals_state = hospitals_state.merge(states)
hospitals_state.head()
Out[38]:
State_Code Type Hospital_Count State TotalPop Men Women Hispanic_Count White_Count Black_Count ... Hispanic_% White_% Black_% Native_% Asian_% Pacific_% Poverty_% ChildPoverty_% Unemployment_% Total_Spent
0 AK CRITICAL ACCESS 90 Alaska 738565 386319 352246 50196.915 454500.377 22572.532 ... 6.796547 61.538304 3.056269 13.768733 6.057875 1.190426 10.189025 13.439155 8.157018 2513318.04
1 AK GENERAL ACUTE CARE 126 Alaska 738565 386319 352246 50196.915 454500.377 22572.532 ... 6.796547 61.538304 3.056269 13.768733 6.057875 1.190426 10.189025 13.439155 8.157018 2513318.04
2 AK LONG TERM CARE 9 Alaska 738565 386319 352246 50196.915 454500.377 22572.532 ... 6.796547 61.538304 3.056269 13.768733 6.057875 1.190426 10.189025 13.439155 8.157018 2513318.04
3 AK MILITARY 36 Alaska 738565 386319 352246 50196.915 454500.377 22572.532 ... 6.796547 61.538304 3.056269 13.768733 6.057875 1.190426 10.189025 13.439155 8.157018 2513318.04
4 AK PSYCHIATRIC 27 Alaska 738565 386319 352246 50196.915 454500.377 22572.532 ... 6.796547 61.538304 3.056269 13.768733 6.057875 1.190426 10.189025 13.439155 8.157018 2513318.04

5 rows × 29 columns

In [39]:
fig = px.scatter(hospitals_state, 
                 x = "Hospital_Count",
                 y = "Total_Spent", 
                 facet_col = "Type",
                 facet_col_wrap = 4,
                 hover_data = ["State"],
                 trendline = "ols",
                 title = "Research-related presciption spending vs number of hospitals per state, per hospital type")
fig.show()

It appears the strongest correlation occurs between general acute care hospitals and military hospitals. Let's enlarge these plots.

In [40]:
general_hospitals_state = hospitals_state[hospitals_state["Type"] == "GENERAL ACUTE CARE"]
military_hospitals_state = hospitals_state[hospitals_state["Type"] == "MILITARY"]
In [41]:
general_fig = px.scatter(general_hospitals_state, 
                         x = "Hospital_Count",
                         y = "Total_Spent", 
                         hover_data = ["State"],
                         trendline = "ols",
                         title = "Prescription spending vs number of general care hospitals per state")

military_fig = px.scatter(military_hospitals_state, 
                          x = "Hospital_Count",
                          y = "Total_Spent", 
                          hover_data = ["State"],
                          trendline = "ols",
                          title = "Prescription spending vs number of military hospitals per state")

general_fig.show()
general_fig.write_image("images/finding2.1.svg")
military_fig.show()
military_fig.write_image("images/finding2.2.svg")

When you consider that from the original dataset, less than 1 in 5 entries were related to a teaching hospital (see pie chart below), the first correlation is not as obvious as it may first appear. The increase in research-related payments upon an increase in general care hospitals may be attributed to the fact that hospitals tend to be the first to trial new or experimental drugs.

In [42]:
pie_finding2 = px.pie(research,
             "Covered_Recipient_Type",
             title = "Distribution of recipient types for research-related drugs")

pie_finding2.show()
pie_finding2.write_image("images/finding2.3.svg")

The second correlation is completely unobvious, and quite surprising to be honest.

Appendix 1: Research-related prescriptions across the US by postcode¶

Cleaning up zip code formatting¶

A United States zip code can either be 5 digits, or 5+4 digits where the two parts of the zip code are separated by a hyphen. The latter option simply provides specific detail about the delivery route, and is a hindrance for our analysis.

In [43]:
# only keeping the first 5 digits of the zip code
research_us["Recipient_Zip"] = research_us["Recipient_Zip_Code"].str.split("-").str[0]
research_us = research_us.drop(columns = "Recipient_Zip_Code")
In [44]:
print(f"Min length: {research_us['Recipient_Zip'].str.len().min()}")
print(f"Max length: {research_us['Recipient_Zip'].str.len().max()}")
Min length: 5
Max length: 5

Producing the heatmap¶

In [45]:
per_zipcode = research_us.groupby("Recipient_Zip")["Total_Amount_of_Payment_USDollars"].sum().reset_index()
per_zipcode.head()
Out[45]:
Recipient_Zip Total_Amount_of_Payment_USDollars
0 01003 38944.00
1 01035 179543.38
2 01062 1750.00
3 01104 29431.20
4 01105 78544.82
In [46]:
us_zip_map = gpd.read_file("data/zip_codes/tl_2019_us_zcta510.shp")
us_zip_map.shape
Out[46]:
(33144, 10)
In [47]:
map_and_zip = us_zip_map.merge(per_zipcode, left_on = "ZCTA5CE10", right_on = "Recipient_Zip")
In [48]:
fig, ax = plt.subplots(1, figsize = (20, 10))

xlim = ([us_zip_map.total_bounds[0] + 30, us_zip_map.total_bounds[2]-200])
ylim = ([us_zip_map.total_bounds[1] + 30, us_zip_map.total_bounds[3]-10])
ax.set_xlim(xlim)
ax.set_ylim(ylim)

heatmap = map_and_zip.plot(column = "Total_Amount_of_Payment_USDollars", 
                  ax = ax,
                  legend = True,
                  figsize = (15, 10),
                  missing_kwds = {"color": "grey"},
                  legend_kwds = {"label": "Total money spent on prescriptions related to a research program",
                                 "orientation": "vertical"})
heatmap.set_axis_off()
heatmap.set_title("Proliferation of research-related prescriptions across the US", fontweight = "bold", fontsize = 20)
Out[48]:
Text(0.5, 1.0, 'Proliferation of research-related prescriptions across the US')

It is clear that analysing this dataset by zip code is far too granular to achieve any proper results. The dataset is simply not big enough for the over thirty thousand individual zip codes across the US — we will have to look at this data per state.

Appendix 2: Using visualisations to conclude false results¶

Suppose we used the _Count variables to conduct correlation analysis rather than _% variables. In doing so, we could present many (flawed) arguments to our readers.

For example, it seems that the unemployment count goes up in parallel with an increase in black populations.

In [49]:
fig = px.scatter(states, 
                 x = "Black_Count",
                 y = "Unemployment_Count", 
                 hover_data = ["State"],
                 size = "TotalPop",
                 trendline = "ols",
                 title = "Truly, correlation does not equal causation")
fig.show()

But when you plot the total population of the state as well (here represented by the size of the bubble), we can see that the increase in unemployment is likely a function of the increase in total population, the confounding variable here.

Hence, we should use _% variables rather than _Count variables here.